Introduction

The given dataset contains information on purchases made through the purchase card programs administered by the state and higher ed institutions. All transactions included in this dataset were online transactions. Through this report, we will summarise the observations and trends within the dataset and create features which can aid in building a model to identify fraudulent transacations, or anomalies!

#Loading required libraries and data for analysis

library(DT) #For easy viewing and interaction of dataframes
library(dplyr) #For data manipulation and wrangling
library(ggplot2) #For visualizations and comprehensive plotting
library(plotly) #For interactive and beautiful graphs
library(gridExtra); library(cowplot); #For better graph grids

ccd <- read.csv("purchase_credit_card.csv")
str(ccd) #To understand the data
## 'data.frame':    442458 obs. of  11 variables:
##  $ Year.Month                  : int  201307 201307 201307 201307 201307 201307 201307 201307 201307 201307 ...
##  $ Agency.Number               : int  1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ...
##  $ Agency.Name                 : chr  "OKLAHOMA STATE UNIVERSITY" "OKLAHOMA STATE UNIVERSITY" "OKLAHOMA STATE UNIVERSITY" "OKLAHOMA STATE UNIVERSITY" ...
##  $ Cardholder.Last.Name        : chr  "Mason" "Mason" "Massey" "Massey" ...
##  $ Cardholder.First.Initial    : chr  "C" "C" "J" "T" ...
##  $ Description                 : chr  "GENERAL PURCHASE" "ROOM CHARGES" "GENERAL PURCHASE" "GENERAL PURCHASE" ...
##  $ Amount                      : num  890 369 165.8 96.4 126 ...
##  $ Vendor                      : chr  "NACAS" "SHERATON HOTEL" "SEARS.COM 9300" "WAL-MART #0137" ...
##  $ Transaction.Date            : chr  "07/30/2013 12:00:00 AM" "07/30/2013 12:00:00 AM" "07/29/2013 12:00:00 AM" "07/30/2013 12:00:00 AM" ...
##  $ Posted.Date                 : chr  "07/31/2013 12:00:00 AM" "07/31/2013 12:00:00 AM" "07/31/2013 12:00:00 AM" "07/31/2013 12:00:00 AM" ...
##  $ Merchant.Category.Code..MCC.: chr  "CHARITABLE AND SOCIAL SERVICE ORGANIZATIONS" "SHERATON" "DIRCT MARKETING/DIRCT MARKETERS--NOT ELSEWHERE CLASSIFIED" "GROCERY STORES,AND SUPERMARKETS" ...

Purchase Card data

The data contains 442,458 observations or transactions. For each transaction, we have 11 different variables providing information about the transaction:


In the next step, we fix the variable names by changing it to more standard R names. This makes it easier for us to code with the variables. We also fix datatypes for the following columns to enable easier data-wrangling: year_month, agency_number, vendor, MCC were converted into factors and transaction_date, posted_date were converted into PoSIXct date formats.

colnames(ccd) <- c("year_month", "agency_number", "agency_name", "ch_last_name",
                   "ch_first_initial", "description", 'amount', 'vendor', 
                   'transaction_date', 'posted_date', 'MCC')
#fixed colnames into more standard R formats for easier syntax

#Fixing data types
ccd$year_month <- as.factor(ccd$year_month) #Converting year month into a factor to enable better graphs
ccd$agency_number <- as.factor(ccd$agency_number) #Converting agency name into a factor to enable better EDA 
ccd$vendor <- as.factor(ccd$vendor) #Converting vendor into a factor to enable better EDA
ccd$MCC <- as.factor(ccd$MCC) #Converting MCC into a factor to enable better EDA.
ccd$transaction_date <- as.Date(ccd$transaction_date, format = "%m/%d/%Y %I:%M:%S %p")
ccd$posted_date <- as.Date(ccd$posted_date, format = "%m/%d/%Y %I:%M:%S %p")
#PoSIXct date formats are easier to work with in R

#These changes allow us to identify the following:
unique(ccd$year_month)
##  [1] 201307 201308 201309 201310 201311 201312 201401 201402 201403 201404
## [11] 201405 201406
## 12 Levels: 201307 201308 201309 201310 201311 201312 201401 201402 ... 201406
#The data includes transactions from a 12-month period between July 2013 to June 2014.
sprintf("Number of unique agencies: %s", length(unique(ccd$agency_number)))
## [1] "Number of unique agencies: 116"
#116 unique agencies
sprintf("Number of unique vendors:",length(unique(ccd$vendor)))
## [1] "Number of unique vendors:"
sprintf("Number of unique Merchant Categories:", length(unique(ccd$MCC)))
## [1] "Number of unique Merchant Categories:"
#86279 unique vendors who fall into 435 unique categories

As can be seen the data includes data for a 12-month period between July 2013 to June 2014. Transactions were made by 116 unique agencies, across 86279 unique vendors who fall into 435 unique categories.


Creating features for use in modelling

As mentioned before, we will use the agency as the primary unit for each transaction. This means we want to identify transactions patterns for each agency and use that to identify anomalies. Let us begin by looking at summary statistics across agencies.

#creating summary statistics for each agency
stat_agency <- ccd %>% group_by(agency_name) %>% 
  dplyr::summarise(count_agency_all = n(), total_amount_agency_all = sum(amount),
            avg_amount_agency_all = mean(amount),
            min_amount_agency_all = min(amount), 
            max_amount_agency_all = max(amount)) %>% 
  arrange(desc(total_amount_agency_all)) %>% ungroup()
summary(stat_agency)
##  agency_name        count_agency_all   total_amount_agency_all
##  Length:124         Min.   :     1.0   Min.   :      36       
##  Class :character   1st Qu.:   130.5   1st Qu.:   24948       
##  Mode  :character   Median :   383.0   Median :  125616       
##                     Mean   :  3568.2   Mean   : 1516458       
##                     3rd Qu.:  1360.2   3rd Qu.:  424601       
##                     Max.   :115995.0   Max.   :33778840       
##  avg_amount_agency_all min_amount_agency_all max_amount_agency_all
##  Min.   :    22.02     Min.   :-42863.0      Min.   :     35.6    
##  1st Qu.:   192.49     1st Qu.: -2250.0      1st Qu.:   1891.0    
##  Median :   264.74     Median :  -812.8      Median :   4661.5    
##  Mean   :  1705.26     Mean   : -3113.7      Mean   :  43649.1    
##  3rd Qu.:   362.34     3rd Qu.:  -211.3      3rd Qu.:  12366.4    
##  Max.   :171619.61     Max.   :   437.2      Max.   :1903858.4
datatable(stat_agency, width = 300)
#plotting avg_amount vs count for each agency. The bubble size represents total amount.
plot9 <- stat_agency %>% filter(avg_amount_agency_all < 6000) %>% 
  ggplot(aes(x = avg_amount_agency_all, y = count_agency_all,
             size = total_amount_agency_all,
             fill = agency_name, color = agency_name)) +
  geom_jitter() + theme_minimal() + 
  xlab("Average Transaction Amount") + ylab("Number of Transactions") +
  theme(legend.position = "none")
ggplotly(plot9)

The number of transactions by each agency ranges from 1 to 115,995 with a mean of 3568 transactions. Similarly, The avg value of a transaction ranges from 22 to 171,620. However, the mean and median are comparatively low at 1705 and 265. The highest avg_transaction value is a 11 standard deviations away from the mean. Therefore, for our visual examination,we will limit the average amount value to 6,000.

As can be seen, Oklahoma State University is an outlier in terms of the number of transactions and total transaction amount as well. From the graph we see that the other agencies are clustered together.

#After grouping by agency, we create new variables to identify the last transaction date and the gap between each transaction and the last transaction, as well the lag between two subsequent transactions.
stat_agency2 <- ccd %>% group_by(agency_name) %>% 
  arrange(agency_name, transaction_date) %>% 
  mutate(last_transaction = max(transaction_date)) %>%
  mutate(gap_transaction = last_transaction - transaction_date) %>% 
  mutate(lag_transaction = transaction_date - lag(transaction_date)) %>% 
  arrange(gap_transaction) %>% ungroup()
datatable(head(stat_agency2), width = 300)

Considering agency as the primary unit, we will use the RFM method (Recency, Frequency and Monetary value) to create features. For all features we create in this category, we will consider three time periods:

In a real-time system, the transaction being analyzed would be the last transaction by the agency.


RFM based Features:

Count: Number of transactions by each agency in the decided time period Total_amount: Sum of the value across all transactions by each agency in the decided time period Average_amount: Average value of a transaction by each agency in the decided time period Min_amount: Minimum value of a transaction by each agency in the decided time Max_amount: Maximum value of a transaction by each agency in the decided time Avg_lag_tnx: Average lag (in difftime) between two transactions by an agency in the decided time *Min_lag_tnx: Minimum lag (in difftime) between two transactions by an agency in the decided time

By using count across 1 day, 1 month and 3 months and all time, we would be able to identify if there is a sudden increase or decrease in the number of transactions. Both can point towards anomalous transactions. Combined with lag between transactions, we can identify anomalous behaviour in number of transactions. Since these are institutional investors, and based on the constant number of transactions each month, we can expect lag to not vary greatly between transactions.

For the value of each transaction, we are creating 4 factors across 3 time periods. By looking at average, minimum, maximum amount of a transaction we can identify anomalous behavior if a transaction value is too low or too high.

At this level of aggregation (Agency), we are creating 21 features in total.

#creating aggregate variables at the agency level and 1 day time period
stat_agency_day <- stat_agency2 %>% filter(gap_transaction == 0) %>% group_by(agency_name) %>% 
  dplyr::summarise(count_agency_day = n(),
            total_amount_agency_day = sum(amount), 
            average_amount_agency_day = mean(amount),
            min_amount_agency_day = min(amount),
            max_amount_agency_day = max(amount),
            avg_lag_tnx_day = mean(lag_transaction),
            min_lag_tnx_day = min(lag_transaction)) %>% ungroup()

#creating aggregate vriables at the agency level and 1month
stat_agency_m <- stat_agency2 %>% filter(gap_transaction < 31) %>% group_by(agency_name) %>% 
  dplyr::summarise(count_agency_m = n(),
                   total_amount_agency_m = sum(amount), 
                   average_amount_agency_m = mean(amount),
                   min_amount_agency_m = min(amount),
                   max_amount_agency_m = max(amount),
                   avg_lag_tnx_m = mean(lag_transaction),
                   min_lag_tnx_m = min(lag_transaction)) %>% ungroup()

#creating aggregate vriables at the agency level and 3months
stat_agency_3m <- stat_agency2 %>% filter(gap_transaction < 91) %>% group_by(agency_name) %>% 
  dplyr::summarise(count_agency_3m = n(),
                   total_amount_agency_3m = sum(amount), 
                   average_amount_agency_3m = mean(amount),
                   min_amount_agency_3m = min(amount),
                   max_amount_agency_3m = max(amount),
                   avg_lag_tnx_3m = mean(lag_transaction),
                   min_lag_tnx_3m = min(lag_transaction)) %>% ungroup()

#Using left_join to tag each agency with the appropriate aggregte variable values
stat_agency <- left_join(stat_agency, stat_agency_day, by = 'agency_name')
stat_agency <- left_join(stat_agency, stat_agency_m, by = 'agency_name')
stat_agency <- left_join(stat_agency, stat_agency_3m, by = 'agency_name')
datatable(head(stat_agency), width = 300)

We will create two more levels of aggregation:

Transactions at each vendor by each agency Transactions at each merchant category by each agency

By creating the same 28 variables for both these levels of aggregation, we gain further insights into the pattern of purchases for each individual agency-vendor pair and for each individual agency-merchant category pair.

Using count of transactions across each aggregation level and each time period, allows us to identify anomalous behavior in terms of types of purchases. For example, if there is a sudden increase in purchases at a specific vendor or merchant category. Similarly, we can use amount and lag as well.

#Collating data by Agency and merchant(vendor)

#agency x merchant for all time
stat_agency_merchant <- ccd %>% group_by(agency_name, vendor) %>% 
  dplyr::summarise(count_merchant_all = n(), total_amount_merchant_all = sum(amount),
                   avg_amount_merchant_all = mean(amount),
                   min_amount_merchant_all = min(amount), 
                   max_amount_merchant_all = max(amount)) %>% 
  arrange(desc(total_amount_merchant_all)) %>% ungroup()

#Viewing transactions by merchant for Oklahoma State University
plot11 <- stat_agency_merchant %>% 
  filter(avg_amount_merchant_all < 6000 
         & agency_name == "OKLAHOMA STATE UNIVERSITY") %>%
  ggplot(aes(x = avg_amount_merchant_all, y = count_merchant_all, 
             size = total_amount_merchant_all, fill = vendor, color = vendor)) + 
  geom_jitter() + theme_minimal() + xlab("Average Transaction Amount") + 
  ylab("Number of Transactions") + 
  ggtitle("Transactions by vendor for Oklahoma State University") + 
  theme(legend.position = "none")
ggplotly(plot11)
#agency x merchant mutated variables
stat_agency_merchant2 <- ccd %>% group_by(agency_name, vendor) %>% 
  arrange(agency_name, vendor, transaction_date) %>%
  mutate(last_transaction = max(transaction_date)) %>% 
  mutate(gap_transaction = (last_transaction - transaction_date)) %>%
  mutate(lag_transaction = transaction_date - lag(transaction_date)) %>% 
  arrange(gap_transaction) %>% ungroup()
str(stat_agency_merchant2)
## tibble [442,458 x 14] (S3: tbl_df/tbl/data.frame)
##  $ year_month      : Factor w/ 12 levels "201307","201308",..: 1 1 5 3 1 4 6 6 5 5 ...
##  $ agency_number   : Factor w/ 116 levels "1000","2000",..: 29 29 29 29 29 29 29 29 29 29 ...
##  $ agency_name     : chr [1:442458] "`DEPARTMENT OF EDUCATION" "`DEPARTMENT OF EDUCATION" "`DEPARTMENT OF EDUCATION" "`DEPARTMENT OF EDUCATION" ...
##  $ ch_last_name    : chr [1:442458] "Bryan" "Ross" "Bowman" "Bryan" ...
##  $ ch_first_initial: chr [1:442458] "J" "P" "I" "J" ...
##  $ description     : chr [1:442458] "GENERAL PURCHASE" "GENERAL PURCHASE" "GENERAL PURCHASE" "GENERAL PURCHASE" ...
##  $ amount          : num [1:442458] 460 773 105 150 205 ...
##  $ vendor          : Factor w/ 86729 levels "#1 PARTY SUPPLIES",..: 95 499 565 957 981 1124 8010 8017 8481 8641 ...
##  $ transaction_date: Date[1:442458], format: "2013-07-25" "2013-07-26" ...
##  $ posted_date     : Date[1:442458], format: "2013-07-26" "2013-07-29" ...
##  $ MCC             : Factor w/ 435 levels " ","ACCOUNTING,AUDITING AND BOOKKEEPING SERVICES",..: 77 129 55 63 80 56 55 55 24 24 ...
##  $ last_transaction: Date[1:442458], format: "2013-07-25" "2013-07-26" ...
##  $ gap_transaction : 'difftime' num [1:442458] 0 0 0 0 ...
##   ..- attr(*, "units")= chr "days"
##  $ lag_transaction : 'difftime' num [1:442458] NA 3 NA NA ...
##   ..- attr(*, "units")= chr "days"
#agency x merchant at 1 day
stat_agency_merchant_day <- stat_agency_merchant2 %>% filter(gap_transaction == 0) %>%
  group_by(agency_name, vendor) %>% 
  dplyr::summarise(count_agency_merchant_day = n(),
                   total_amount_agency_merchant_day = sum(amount), 
                   average_amount_agency_merchant_day = mean(amount),
                   min_amount_agency_merchant_day = min(amount),
                   max_amount_agency_merchant_day = max(amount),
                   avg_lag_tnx_merchant_day = mean(lag_transaction),
                   min_lag_tnx_merchant_day = min(lag_transaction)) %>% ungroup()

#agency x merchant at 1 month
stat_agency_merchant_m <- stat_agency_merchant2 %>% filter(gap_transaction < 31) %>%
  group_by(agency_name, vendor) %>% 
  dplyr::summarise(count_agency_merchant_m = n(),
                   total_amount_agency_merchant_m = sum(amount), 
                   average_amount_agency_merchant_m = mean(amount),
                   min_amount_agency_merchant_m = min(amount),
                   max_amount_agency_merchant_m = max(amount),
                   avg_lag_tnx_merchant_m = mean(lag_transaction),
                   min_lag_tnx_merchant_m = min(lag_transaction)) %>% ungroup()

#agency x merchant at 3 months
stat_agency_merchant_3m <- stat_agency_merchant2 %>% filter(gap_transaction < 91) %>%
  group_by(agency_name, vendor) %>% 
  dplyr::summarise(count_agency_merchant_3m = n(),
                   total_amount_agency_merchant_3m = sum(amount), 
                   average_amount_agency_merchant_3m = mean(amount),
                   min_amount_agency_merchant_3m = min(amount),
                   max_amount_agency_merchant_3m = max(amount),
                   avg_lag_tnx_merchant_3m = mean(lag_transaction),
                   min_lag_tnx_merchant_3m = min(lag_transaction)) %>% ungroup()

#left joining using agency name and vendor to ensure the right transacations are tagged with the right variable values
stat_agency_merchant <- left_join(stat_agency_merchant, stat_agency_merchant_day,
                                  by = c('agency_name' = 'agency_name', 'vendor' = 'vendor'))
stat_agency_merchant <- left_join(stat_agency_merchant, stat_agency_merchant_m,
                                  by = c('agency_name', 'vendor'))
stat_agency_merchant <- left_join(stat_agency_merchant, stat_agency_merchant_3m,
                                  by = c('agency_name', 'vendor'))
datatable(head(stat_agency_merchant), width = 300)

As can be seen from the graph, for Oklahoma State University, most of the transactions are low-value transactions. However, there are a significant number of transactions with higher values but the values don’t repeat often.

This observation allows us to create a new ratio which measures the deviation of the transaction amount from the mean. A high deviation could be a marker for fraudulent transaction.

#Collating data by Agency and merchant category

#agency x merchant category for all time
stat_agency_mcc <- ccd %>% group_by(agency_name, MCC) %>% 
  dplyr::summarise(count_mcc_all = n(), total_amount_mcc_all = sum(amount),
                   avg_amount_mcc_all = mean(amount),
                   min_amount_mcc_all = min(amount), 
                   max_amount_mcc_all = max(amount)) %>% 
  arrange(desc(total_amount_mcc_all)) %>% ungroup()

#viewing transactions by merchant category for oklahoma state university
plot12 <- stat_agency_mcc %>% filter(avg_amount_mcc_all < 6000
                                          & agency_name == "OKLAHOMA STATE UNIVERSITY") %>% 
  ggplot(aes(x = avg_amount_mcc_all, y = count_mcc_all, size = total_amount_mcc_all,
             fill = MCC, color = MCC)) + geom_jitter() + 
  theme_minimal() + xlab("Average Transaction Amount") + ylab("Number of Transactions") +
  ggtitle("Transactions by MCC for Oklahoma State University") + theme(legend.position = "none")
ggplotly(plot12)
#agency x merchant category variable mutations
stat_agency_mcc2 <- ccd %>% group_by(agency_name, MCC) %>% 
  arrange(agency_name, MCC, transaction_date) %>%
  mutate(last_transaction = max(transaction_date)) %>% 
  mutate(gap_transaction = (last_transaction - transaction_date)) %>%
  mutate(lag_transaction = transaction_date - lag(transaction_date)) %>% 
  arrange(gap_transaction) %>% ungroup()
str(stat_agency_mcc2)
## tibble [442,458 x 14] (S3: tbl_df/tbl/data.frame)
##  $ year_month      : Factor w/ 12 levels "201307","201308",..: 7 7 7 7 7 7 7 7 6 4 ...
##  $ agency_number   : Factor w/ 116 levels "1000","2000",..: 29 29 29 29 29 29 29 29 29 29 ...
##  $ agency_name     : chr [1:442458] "`DEPARTMENT OF EDUCATION" "`DEPARTMENT OF EDUCATION" "`DEPARTMENT OF EDUCATION" "`DEPARTMENT OF EDUCATION" ...
##  $ ch_last_name    : chr [1:442458] "Bowman" "Ross" "Ross" "Ross" ...
##  $ ch_first_initial: chr [1:442458] "I" "P" "P" "P" ...
##  $ description     : chr [1:442458] "GENERAL PURCHASE" "AIR TRAVEL" "AIR TRAVEL" "ROOM CHARGES" ...
##  $ amount          : num [1:442458] 1239 556 556 83 83 ...
##  $ vendor          : Factor w/ 86729 levels "#1 PARTY SUPPLIES",..: 78309 15190 15189 21757 21757 21757 21757 21757 8010 1124 ...
##  $ transaction_date: Date[1:442458], format: "2014-01-21" "2014-01-28" ...
##  $ posted_date     : Date[1:442458], format: "2014-01-22" "2014-01-30" ...
##  $ MCC             : Factor w/ 435 levels " ","ACCOUNTING,AUDITING AND BOOKKEEPING SERVICES",..: 4 24 24 51 51 51 51 51 55 56 ...
##  $ last_transaction: Date[1:442458], format: "2014-01-21" "2014-01-28" ...
##  $ gap_transaction : 'difftime' num [1:442458] 0 0 0 0 ...
##   ..- attr(*, "units")= chr "days"
##  $ lag_transaction : 'difftime' num [1:442458] 173 6 0 121 ...
##   ..- attr(*, "units")= chr "days"
#agency x merchant category variables for 1 day
stat_agency_mcc_day <- stat_agency_mcc2 %>% filter(gap_transaction == 0) %>%
  group_by(agency_name, MCC) %>% 
  dplyr::summarise(count_agency_mcc_day = n(),
                   total_amount_agency_mcc_day = sum(amount), 
                   average_amount_agency_mcc_day = mean(amount),
                   min_amount_agency_mcc_day = min(amount),
                   max_amount_agency_mcc_day = max(amount),
                   avg_lag_tnx_mcc_day = mean(lag_transaction),
                   min_lag_tnx_mcc_day = min(lag_transaction)) %>% ungroup()

#agency x merchant category variables for 1 month
stat_agency_mcc_m <- stat_agency_mcc2 %>% filter(gap_transaction < 31) %>%
  group_by(agency_name, MCC) %>% 
  dplyr::summarise(count_agency_mcc_m = n(),
                   total_amount_agency_mcc_m = sum(amount), 
                   average_amount_agency_mcc_m = mean(amount),
                   min_amount_agency_mcc_m = min(amount),
                   max_amount_agency_mcc_m = max(amount),
                   avg_lag_tnx_mcc_m = mean(lag_transaction),
                   min_lag_tnx_mcc_m = min(lag_transaction)) %>% ungroup()

#agency x merchant category variables for 3 months
stat_agency_mcc_3m <- stat_agency_mcc2 %>% filter(gap_transaction < 91) %>%
  group_by(agency_name, MCC) %>% 
  dplyr::summarise(count_agency_mcc_3m = n(),
                   total_amount_agency_mcc_3m = sum(amount), 
                   average_amount_agency_mcc_3m = mean(amount),
                   min_amount_agency_mcc_3m = min(amount),
                   max_amount_agency_mcc_3m = max(amount),
                   avg_lag_tnx_mcc_3m = mean(lag_transaction),
                   min_lag_tnx_mcc_3m = min(lag_transaction)) %>% ungroup()

#left joining new variables with transactions by agency name and MCC
stat_agency_mcc <- left_join(stat_agency_mcc, stat_agency_mcc_day,
                                  by = c('agency_name' = 'agency_name', 'MCC' = 'MCC'))
stat_agency_mcc <- left_join(stat_agency_mcc, stat_agency_mcc_m,
                                  by = c('agency_name', 'MCC'))
stat_agency_mcc <- left_join(stat_agency_mcc, stat_agency_mcc_3m,
                                  by = c('agency_name', 'MCC'))
datatable(head(stat_agency_mcc, n = 25), width = 300)

To summarize the factors created so far are based on the RFM principles and are aggregated on 3 levels:

  • Agency
  • Agency and Merchant
  • Agency and Merchant Category

And are calculated across three time periods:

  • 1 day
  • 1 month
  • 3 months

And include 7 variables each:

  • Count of transactions
  • avg amount per transaction
  • min amount per transaction
  • max amount per transaction
  • total amount across all transactions
  • avg lag between transactions
  • min lag between transactions

This leads into a total of (3 x 3 x 7) 63 variables. Let us combine all variables into a single dataframe with all transactions.

ccfactors <- left_join(ccd, stat_agency, by = c('agency_name'))
ccfactors <- left_join(ccfactors, stat_agency_merchant, by = c("agency_name", "vendor"))
ccfactors <- left_join(ccfactors, stat_agency_mcc, by = c("agency_name", "MCC"))
datatable(head(ccfactors), width = 300)

***

Deviation Ratios

Now to support the RFM variables created we will create new ratios which measure the deviation from the average for:

  • transaction amount
  • lag between transactions

Again, we will do this at 3 levels:

  • Agency
  • Agency and merchant
  • Agency and merchant category

across 3 time periods:

  • 1 day
  • 1 month
  • 3 months
  • All time ( only for transaction amount)
#Adding lag parameters for ratio calculation
ccfactors <- ccfactors %>% group_by(agency_name) %>% 
  arrange(agency_name, transaction_date) %>% 
  mutate(lag_agency = transaction_date - lag(transaction_date)) %>% ungroup()

ccfactors <- ccfactors %>% group_by(agency_name, vendor) %>% 
  arrange(agency_name, vendor, transaction_date) %>% 
  mutate(lag_vendor = transaction_date - lag(transaction_date)) %>% ungroup()

ccfactors <- ccfactors %>% group_by(agency_name, MCC) %>% 
  arrange(agency_name, MCC, transaction_date) %>% 
  mutate(lag_mcc = transaction_date - lag(transaction_date)) %>% ungroup()

#Creating ratios
ccfactors <- ccfactors %>%
  mutate(
    dev_tnx_agency_day = (amount/average_amount_agency_day),
    dev_tnx_agency_merchant_day = (amount/average_amount_agency_merchant_day),
    dev_tnx_agency_mcc_day = (amount/average_amount_agency_mcc_day),
    dev_tnx_agency_m = (amount/average_amount_agency_m),
    dev_tnx_agency_merchant_m = (amount/average_amount_agency_merchant_m),
    dev_tnx_agency_mcc_m = (amount/average_amount_agency_mcc_m),
    dev_tnx_agency_3m = (amount/average_amount_agency_3m),
    dev_tnx_agency_merchant_3m = (amount/average_amount_agency_merchant_3m),
    dev_tnx_agency_mcc_3m = (amount/average_amount_agency_mcc_3m),
    dev_tnx_agency_all = (amount/avg_amount_agency_all),
    dev_tnx_agency_merchant_all = (amount/avg_amount_merchant_all),
    dev_tnx_agency_mcc_all = (amount/avg_amount_mcc_all),
    lag_agency_day = (lag_agency/as.numeric(avg_lag_tnx_day)),
    lag_agency_merchant_day = (lag_vendor/as.numeric(avg_lag_tnx_merchant_day)),
    lag_agency_mcc_day = (lag_mcc/as.numeric(avg_lag_tnx_mcc_day)),
    lag_agency_3m = (lag_agency/as.numeric(avg_lag_tnx_3m)),
    lag_agency_merchant_3m = (lag_vendor/as.numeric(avg_lag_tnx_merchant_3m)),
    lag_agency_mcc_3m = (lag_mcc/as.numeric(avg_lag_tnx_mcc_3m)),
  )

datatable(head(ccfactors), width = 300)

With these variables, we can measure deviation from the average which would help us identify anomalies quickly.


Summary

We identified 3 time periods as 1 day, 1 month and 3 months based on the data availability. We identified three levels of aggregations as agency, agency and vendor, and agency and merchant category. We also identified 7 RFM variables across each aggregation and each time period, and 2 deviation variables for each aggregation and time period.

This gives us a grand total of 82 variables. (RFM: 3 * 3 * 7 + dev: 2 * 3 *3 + 1)

We can also explore further variables based on entities (For example, cardholder name and vendor name) and their exposure to fraudulent activities in their network. (For example, Ratio of fraudulent transactions per 1000 for each vendor to fraudulent transaction per 1000 across all vendors).